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SECTION  I 
INTRODUCTION 

A  major  problem  in  the  Fleet  Replacement  Squadron  (FRS)  is  determining 
the  appropriate  amount  of  in-flight  training  that  should  be  given  a  pilot 
trainee  to  meet  the  objectives  of  the  FRS  in  serving  the  needs  of  fleet 
squadrons.  The  difficulty  center  on  determining  performance  requirements 
and  assessing  skill  levels  appropriate  for  the  FRS  graduate.  This  inability 
to  achieve  precise  determination  of  pilot  performance  at  given  times  commen¬ 
surate  with  squadron  goals  has  hampered  the  management  of  training,  partic¬ 
ularly  the  scheduling  of  replacement  pilot  training  (in  terms  of  both  effec¬ 
tiveness  and  efficiency). 

These  obstacles  are  of  most  concern  in  the  training  of  first-tour  pilots 
(recent  graduates  of  Undergraduate  Pilot  Training  who  are  entering  their 
first  operational  aircraft  type).  These  students  must  acquire  many  skills 
and  learn  to  organize  much  Information  as  minima  in  the  very  short  time 
period  available  in  order  to  graduate  to  a  fleet  assignment.  Extending 
training  beyond  assured  proficiency  is  expensive  in  resource  use;  training  to 
less  than  required  proficiency  incurs  significant  risks. 

Improving  precision  in  judging  pilot  proficiency  and  enhancing  manage¬ 
ment  ability  in  prescribing  training  sensitive  to  individual  differences  in 
student  performance  and  instructor  evaluation  continues  to  be  a  prime  require¬ 
ment  in  military  flight  training. 

This  report  proposes  a  method  for  achieving  improvements  in  the  precision 
of  proficiency  judgments  and  in  determining  student  proficiency.  This  pro¬ 
posed  solution,  identified  as  the  Computer  Aided  Training  Evaluation  and 
Scheduling  (CATES)  system,  provides  a  computer  managed,  prescriptive  training 
program  based  on  individual  student  performance.  In  essence,  the  CATES 
system  emphasizes  the  following; 

•  clearly  defines  the  level  of  skills  required  of  the  FRS  graduate 

•  adds  precision  to  instrictor  pilot  judgments  by  providing  a  more 
clearly  defined  comparison  standard 

•  increases  reliability  o4'  instructor  pilot  judgments  by  grading  each 
task  execution  rather  titan  using  the  instructor  pilot's  "subjective 
average"  of  all  task  executions 

•  lists  tasks  for  individual  students  indicating  one  of  the  folic wing 
decisions: 

..  desired  proficiency  attained 

..  proficiency  below  acceptable  limits,  or 

..  proficiency  undetermined,  continue  training/practice. 

•  provides  an  acceptable  and  workable  performance  assessment  schema 
for  use  in  a  Computer  Managed  Instruction  (CMI)  system. 
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The  report  describes  the  problems  encountered  In  attempts  to  determine 
proficient  task  performance  of  students  and  the  conceptual  development  of 
the  CATES  system  as  a  method  that  may  be  used  In  making  proficiency  determi¬ 
nations.  An  effort  Is  In  progress  to  test  the  operational  feasibility  of 
the  CATES  system  and  to  evaluate  the  validity  of  proficiency  determinations 
made  by  the  system.  Results  of  this  effort  will  be  presented  In  a  future 
report. 

ORGANIZATION  OF  THE  REPORT 

In  addition  to  this  introduction,  three  sections  and  one  appendix 
are  presented.  Section  II  presents  the  method  designed  to  strengthen 
grading  criteria.  Although  the  criteria  continue  to  be  based  on  subjective 
judgments  of  Instructor  pilots,  by  clarifying  tasks  to  be  measured  and 
by  providing  a  standard  on  which  to  base  subjective  judgments  the  criteria 
should  reflect  a  greater  precision. 

Section  III  presents  the  method  for  formalizing  and  quantifying  the 
parameters  of  the  proficiency  determination  process. 

Section  IV  presents  preimplementation  considerations  of  the  CATES 
system  and  its  applicability  at  a  specific  FRS.  Issues  to  be  tested  as  well 
as  future  implications  of  the  CATES  system  are  discussed. 

The  appendix  provides  a  mathematical  discussion  of  the  Wald  Binomial 
Probability  Ratio  Test. 
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SECTION  II 

IMPROVEMENT  OF  GRADING  PROCEDURES 


CURRENT  PRACTICE 

Determination  of  the  proficient  performance  of  aircraft  flying  tasks 
continues  to  be  a  subjective  judgment  made  by  instructor  pilots.  Current 
practice  in  training  squadrons  consists  of  "flights"  during  which  a  subset  of 
tasks  from  the  training  syllabus  are  performed  a  varying  number  of  times  by 
the  pilot  trainee  at  the  discretion  of  the  instructor  pilots.  During  or 
shortly  after  each  flight,  the  instructor  pilot  "grades"  the  pilot  trainee  on 
the  tasks  performed  using  a  standard  scale  but  also  employing  his  own  personal 
criteria.  While  instructors  differ  in  their  personal  rating  bias  (hard- 
easy),  they  attempt  to  grade  in  terms  of  "average  performance  at  this  stage 
of  training."  It  is  usual  for  the  pilot  trainee  to  be  exposed  to  several 
different  instructor  pilots.  After  a  specified  minimum  number  of  flights, 
and  a  recommendation  by  an  instructor  pilot,  the  pilot  trainee  is  scheduled 
for  a  final  "check  flight."  His  performance  on  selected  tasks  is  graded  by 
an  instructor  pilot  acting  in  the  independent  role  of  "check  pilot."  Should 
the  pilot  trainee  not  perform  the  flight  consonant  with  the  standards  of 
performance  expected  of  him  by  the  "check  pilot,"  he  is  rescheduled  for 
additional  "check  flights"  until  he  is  deemed  proficient. 

Student  exposure  to  training  tasks  can  be  variable  due  to  instructor 
differences  and  varying  performance  standards.  In  addition,  each  individual 
pilot  trainee  exhibits  variability  in  successive  performances  on  complex 
procedural  and  psychomotor  tasks.  This  variability  of  skilled  task  perfor¬ 
mance  has  been  well  documented  (Fitts  and  Posner,  1968).  Further  compounding 
this  problem  of  inconsistent  performance,  the  pilot  trainee  is  transitioning 
from  a  level  of  performance  well  below  the  required  level  to  a  required 
standard  of  performance.  This  transition  reflects  different  learning  rates 
by  the  individual  pilot  trainees.  Learning  rates  are  also  highly  variable 
within  and  between  individuals  (Sidman,  1960).  It  is  quite  obvious  that 
determination  of  asymptotic  performance  commensurate  with  desired  performance 
standards  is  difficult  to  ascertain  using  the  current  practice. 

PROFICIENCY  GRADING  SYSTEM 

In  a  series  of  studies  conducted  by  the  Training  Analysis  and  Evaluation 
Group  (TAEG)  to  determine  the  effectiveness  of  Device  2F87F  (P-3  Operational 
Flight  Trainer)  in  the  FRS,  the  inadequacies  of  current  grading  procedures 
were  recognized  (Browning,  Ryan,  Scott,  and  Smode,  1977 ;  Browning,  Ryan,  and 
Scott,  1978).  To  overcome  these  inadequacies,  the  TAEG  instituted  a  "profi¬ 
ciency  grading  system."  The  system  provided  a  clearer  picture  of  the  trainee' 
flight  task  performance  in  both  simulator  and  aircraft  training.  The  pro¬ 
ficiency  grading  system  still  reqjired  a  subjective  judgment  by  instructor 
and  check  pilots.  However,  the  i  istructors  graded  task  performance  against  a 
precise  standard:  "P  was  defined  as  performance  estimated  to  be  equivalent 
to  that  required  to  demonstrate  ompetence  in  that  task  on  the  conventional 
FLY  6  check"  (Browning,  et  al.,  1)77,  p.  20).  This  standard  focuses  on  the 
required  terminal  level  of  perfor nance;  i.e.,  the  objective  of  training. 
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Actual  grading  of  performance  was  accomplished  using  a  dichotomous  scale. 

Task  performance  that  met  or  exceeded  the  standard  was  recorded  as  "P";  task 
performance  that  did  not  meet  the  standard  was  recorded  as  "1."  The  profic¬ 
iency  grading  introduced  by  the  TAEG  had  a  further  requirement.  Performance 
was  graded  each  time  the  task  was  performed  and  this  series  of  graded  trials 
was  recorded  and  kept  in  the  sequence  of  presantation.  The  procedure  of 
grading  each  task  trial  as  It  was  performed  eliminated  the  requirement  for 
the  instructor  to  make  a  summary  judgment  of  task  proficiency  based  on  pilot 
trainee  performance  of  successive  task  trials  during  a  flight. 

The  advantages  of  a  proficiency  grading  system  for  increasing  the  pre¬ 
cision  of  performance  judgments  have  been  incorporated  in  the  CATES  system. 
The  performance  standard  used  In  the  CATES  system  is  defined  as  task  perfor¬ 
mance  estimated  to  be  equivalent  to  that  required  to  earn  an  adjective  rating 
of  "Qualified"  and/or  a  numerical  score  of  4  on  the  Naval  Air  Training  and 
Operating  Procedures  Standardization  (NATOPS)  Program  flight  evaluation.  The 
CATES  system  uses  the  same  proficiency  grading  procedure  as  discussed  pre¬ 
viously.  Although  the  grading  procedure  increases  the  precision.  It  does  not 
reduce  several  sources  of  variability  in  trainee  performance;  e.g.,  task  dif¬ 
ficulty  and  learning  rates. 

The  proficiency  grading  procedure  results  in  a  task  performance  or 
training  protocol  for  each  task.  Two  hypothetical  trainee  records  (protocols 
from  the  same  trainee)  are  shown  in  table  1. 


TABLE  1. 

HYPOTHETICAL  TASK  PERFORMANCE  OF  ONE  TRAINEE 

FOR  TWO  DIFFERENT  TASKS 

Task 

Training  Protocol 

Task  A 

11P1P1PPP1PP 

Task  B 

1 PPPPPPPPF  PP 

It  could  be  inferred  that  "Task  A"  is  more  dfficult  than  "Task  B"  or  it 
could  be  inferred  that  the  trainee  is  more  proficient  on  "Task  B"  tfian  "Task 
A." 


Table  2  contains  examples  of  trainee  task  performance  protocols  for  two 
different  kinds  of  tasks  and  hypothetical  task  protocols  for  a  trained  pilot. 
The  pilot  trainees  exhibit  different  protocols  initially  (more  "1's"  than 
"P's")  but  the  variability  eventually  will  diminish.  Learning  rates  differ 
among  tasks  as  shown  by  comparing  Task  A  with  Task  B.  During  later  flights/ 
sessions  the  protocols  for  the  pilot  trainee  are  not  readily  distinguishable 
from  those  of  a  trained  pilot.  A  procedural  problem  remains  in  determining 
when  task  performance  protocols  for  trainees  matched  the  protocols  of  trained 
pilots. 


'map***,  - 
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TABLE  2.  COMPARISON  OF  HYPOTHETICAL  TASK  PERFORMANCE  PROTOCOLS  FOR 
TWO  DIFFERENT  TASKS  AND  TWO  LEVELS  OF  AVIATOR  PROFICIENCY 


Task/Aviator 

Training  Procotol  During  Flights/Sessions 

One 

Two 

Three 

Four 

Five 

Six 

TASK  A 

Pilot  Trainee 

111 

Pll 

1P1 

1PP 

PPP 

PP 

Trained  Pilot 

PPP1 

PPP 

1PP 

P 

PI 

PPP 

TASK  B 

Pilot  Trainee 

11 

IP 

P 

PPP 

PP 

PP 

Trained  Pilot 

PP 

PPP 

PI 

PP 

P 

PP 

The  essence  of  the  problem  lies  in  assessing,  with  a  specified  degree  of 
confidence,  the  point  at  which  proficiency  has  been  obtained. 

Several  ways  to  deal  with  the  problem  were  explored.  Two  approaches 
were  found  in  previous  research  concerned  with  proficiency  assessment.  The 
first  approach  was  to  arbitrarily  define  the  point  at  which  proficiency  was 
attained  by  the  following  rule: 

(1)  over  50  percent  of  the  trials  (for  a  given 
task)  on  any  flight  had  to  be  "P"  and  (2)  at 
least  50  percent  of  the  trials  were  P  on  all 
subsequent  flights  (Browning,  et  al.,  1978, 
p.  23). 

The  second  approach  was  used  in  the  evaluation  of  the  Initial  Entry  Rotary 
Wing  Flight  Training  Program  by  the  Army  (USAAVNC  Evaluation  Team,  1979). 

The  tasks  were  graded  by  daily  performance  rather  than  by  individual  trials; 
however,  the  approach  used  to  determine  proficiency  could  also  be  incorporated 
with  graded  trials. 

The  point  of  principal  concern  was  the  training 
day  on  which  the  student  achieved  proficiency 
on  each  maneuver.  Achievement  of  maneuver  pro¬ 
ficiency  was  defined  as  that  training  day  on 
which  the  third  successive  (+)  grade  on  the 
maneuver  was  given  the  student.  That  is,  the 
student  was  required  to  perform  a  maneuver  in 
accord  with  established  USAAVNC  standards  on 
three  successive  occasions  before  he  was  judged 
to  be  proficient  on  that  maneuver  (USAAVNC 
Evaluation  Team,  1979,  p.  21). 
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While  both  of  the  above  approaches  are  logical,  objective,  and  expedient, 
they  are  faulty.  Both  require  training  protocols  that  include  initial  and 
final  levels  of  proficiency  to  make  accurate  performance  determinations.  In 
other  words,  they  are  "after  the  fact"  rather  than  predictive.  Another  flaw 
is  that  an  arbitrary  number  of  "P"  trials  is  not  realistic  across  all  tasks 
due  to  differences  in  task  difficulty.  In  addition,  these  approaches  may  not 
accommodate  situations  where  only  a  small  number  of  training  trials  are  given 
or  where  there  are  wide  differences  in  learning  rates  of  trainees.  Finally, 
the  instructor's  judgment  may  be  biased  if  he  has  knowledge  of  an  arbitrary 
decision  rule. 

SEQUENTIAL  METHOD.  Both  of  the  above  approaches  require  a  sample  of  trials 
of  trainee  performance  before  the  rule  can  be  applied.  An  alternate  approach 
would  be  to  examine  trials  taken  one  at  a  time  and  accumulate  the  informaticn 
for  input  into  the  decision  model  (Hoel,  1971).  Using  this  approach,  one 
would  expect  to  be  in  a  better  position  to  make  decisions  than  if  no  attempt 
were  made  to  look  at  the  data  until  a  sample  of  fixed  size  had  been  taken. 

There  are  methods  available,  using  sequential  sampling  techniques  and  a 
statistical  decision  model,  that  operate  on  this  accumulation  of  information 
basis  and  that  require  considerably  less  sampling  on  the  average  than  the 
fixed-size  sample  methods.  The  statistical  decision  model  Is  limited  to  two 
choices  in  decision  making  (three  choices  if  one  considers  deferring  a 
decision  as  a  decision).  This  limitation  is  not  troublesome  when  applied  to 
proficiency  determination.  The  decisions  of  primary  concern  are  simply:  Is 
the  trainee  proficient?  or,  alternatively,  Is  the  trainee  not  proficient? 
Additional  advantages  are:  (1)  Decisions  are  reached  based  on  a  minimum 
number  of  trials  and  (2)  Decisions  are  made  with  an  established  level  of 
confidence. 
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SECTION  III 
CATES  DECISION  MODEL 

One  sequential  method  that  may  be  used  as  a  means  for  making  statistical 
decisions  with  a  minimum  sample  was  introduced  by  Wald  (1947).  Probability 
ratio  tests  and  corresponding  sequential  procedures  were  developed  for  several 
statistical  distributions.  One  of  the  tests,  the  binomial  probability  ratio 
test,  was  formulated  in  the  context  of  a  sampling  procedure  to  determine 
whether  a  collection  of  a  manufactured  product  should  be  rejected  because  the 
proportion  of  defectives  is  too  high  or  should  be  accepted  because  the  propor¬ 
tion  of  defectives  is  below  an  acceptable  level.  The  sequential  testing 
procedure  also  provides  for  a  postponement  of  decisions  concerning  acceptance 
or  rejection.  This  deferred  decision  is  based  on  prescribed  values  of  alpha 
(a)  and  beta  (ft).  Alpha  (a)  limits  errors  of  declaring  something  "True"  when 
it  is  "False"  (Type  I  error;.  Beta  (ft)  limits  errors  of  declaring  something 
"False"  when  it  is  "True"  (Type  II  error). 

In  an  industrial  quality  control  setting,  the  inspector  needs  a  chart 
similar  to  figure  1  to  perform  a  sequential  test  to  determine  if  a  manufac¬ 
turing  process  has  turned  out  a  lot  with  too  many  defective  items  or  whether 
the  proportion  of  defects  is  acceptable.  As  each  item  is  observed,  the 
inspector  plots  a  point  on  the  chart  one  unit  to  the  right  if  it  is  not 
defective,  one  unit  to  the  right  and  one  unit  up  if  the  item  is  defective. 

If  the  plotted  line  crosses  the  upper  parallel  line,  the  inspector  will  reject 
the  production  lot.  If  the  plotted  line  crosses  the  lower  parallel  line,  the 
lot  will  be  accepted.  If  the  plotted  line  remains  between  the  two  parallel 
lines  of  the  sequential  decision  chart,  another  sample  item  will  be  drawn 
and  observed/tested. 

This  sequential  sampling  procedure  decision  model  has  been  previously 
used  in  educational  and  training  settings.  Ferguson  (1969)  used  the  sequential 
test  to  determine  whether  individual  stidents  should  be  advanced  or  given 
remedial  assistance  after  they  completed  learning  modules  of  instruction. 
Similarly,  Kalisch  (1980)  employed  the  sequential  test  for  an  Air  Force 
Weapons  Mechanics  Training  Course  (63ABR46320)  conducted  at  Lowry  Air  Force 
Base,  Colorado.  Results  from  both  applications  of  sequential  testing  indi¬ 
cate  greater  efficiency  than  for  tests  composed  of  fixed  numbers  of  items. 

It  appears  sequential  testing  may  substantially  reduce  testing  time. 

The  CATES  system  decision  model  uses  sequential  testing  similar  to  those 
applications  previously  cited.  The  decision  model  focuses  on  proportions  of 
proficient  trials  (analogous  to  nondefectives  or  correct  responses)  whereas, 
in  previous  applications,  proportions  of  defectives  or  incorrect  responses 
were  the  items  of  interest.  This  approach  does  not  alter  the  logic  of  the 
sequential  sampling  procedure  or  the  decision  model.  It  does  enhance  the 
"meaningful ness"  of  the  procedure  in  decisions  concerning  proficiency  because 
the  ultimate  goal  is  to  determine  "proficiency"  rather  than  "nonproficiency." 

It  should  be  noted  that  in  the  industrial  quality  control  setting,  sampling 
occurs  after  the  manufacturing  process.  In  the  educational  and  training 
applications  cited  above  (Ferguson.  1969  and  Kalisch,  1980),  sequential 
sampling  occurred  after  the  learning  period.  In  the  CATES  system,  the  sequen¬ 
tial  sampling  occurs  during  the  learning  period  and  eventually  terminates  it. 


Figure  1.  Hypothetical  Sequential  Sampling  Chart 


TAE6  Report  No.  94 


CATES  SYSTEM  MODEL  PARAMETERS 

The  decision  model  can  be  described  as  consisting  of  decision  boundaries. 
Referring  to  figure  1,  the  parallel  lines  represent  those  decision  boundaries. 
Crossing  the  upper  line,  or  boundary,  results  in  a  decision  to  "Reject  Lot"; 
crossing  the  lower  line,  or  boundary,  results  in  a  decision  to  "Accept  Lot." 

In  the  CATES  system,  these  decision  boundaries  translate  to  "Proficient"  and 
"Not  Proficient."  Calculations  of  the  decision  boundaries  require  four 
parameters.  These  four  parameters  are: 

D 

1  Lowest  acceptable  proportion  of  proficient  trials  (P)  required 
to  pass  the  NATOPS  flight  evaluation  with  a  grade  of  "Quali¬ 
fied."  Passage  of  the  NATOPS  flight  evaluation  is  required  to 
be  considered  a  trained  aviator  in  an  operational  (fleet) 
squadron. 

p 

2  Acceptable  proportion  of  proficient  trials  (P)  that  represent 
desirable  performance  on  the  NATOPS  flight  evaluation. 

Alpha  (a)  The  probability  of  making  a  TYPE  I  decision  error  (deciding  a 
student  is  proficient  when  in  fact  he  is  not  proficient). 

Beta  (J3)  The  probability  of  making  a  TYPE  II  decision  error  (deciding 
a  student  is  not  proficient  when  in  fact  he  is  proficient). 

Parameter  setting  is  a  crucial  element  in  the  development  of  the 
sequential  sampling  decision  model.  Kalisch  (1980)  outlines  three  methods 
for  selecting  proficient/not  proficient,  performance  (qQ/q-j  values)  as: 

Method  1 --External  Criterion.  Individuals  are 
classified  as  masters,  non-masters,  or  unknown 
on  the  basis  of  performance  on  criteria  directly 
related  to  the  instructional  objectives.  These 
criteria  can  be  in  terms  of  demonstrated  levels 
of  proficiency  either  on  the  job  or  in  a  train¬ 
ing  environment.  The  mean  proportion  of  items 
answered  correctly  by  the  masters  on  an  objec¬ 
tive  would  provide  an  estimate  for  qQ.  Similarly, 
q,  would  be  the  proportion  correct  for  the 
ndn-masters. 

Method  2— Rationalization.  Experts  in  the  subject 
area  who  understand  the  relation  of  the  training 
objectives  to  the  end  result;  e.g.,  on-the-job 
performance,  select  the  qQ  and  q,  values  to 
reflect  their  estimat  on  of  the  necessary  levels 
of  performance.  This  method  is  probably  the 
closest  to  that  now  used  by  the  Air  Force.  The 
procedure  may  provide  somewhat  easier  decision 
making  since  specifying  two  values  creates  an 
indecision  zone--neither  mastery  nor  non-mastery. 

This  indecision  zone  indicates  that  performance 
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is  at  a  level  which  may  not  be  mastery  but  is 
not  sufficiently  poor  to  be  considered  at  a  non¬ 
mastery  level. 


Method  3--Representative  Sample.  The  scores  of 
prior  trainees,  who  demonstrate  the  entire  range 
from  extremely  poor  to  exemplary  performance 
on  objectives,  are  used  to  estimate  qQ  and  q, . 

The  proportion  correct  for  the  entire usample  is 
used  to  obtain  an  initial  cutting  score  C.  Scores 
are  separated  into  two  categories:  (a)  those 
scores  greater  than  or  equal  to  C  and  (b)  those 
less  than  C.  For  each  category,  the  mean  pro¬ 
portion  correct  score  is  computed.  The  mean  for 
tfj  first  category  equals  qQ;  the  mean  for  the 
second  category  equals  q.|.  u 


Selection  of  values  for  P,  and  P2  (P,  =  q,  and  Pp  =  qQ  in  Kalisch,  1980)  for 
the  CATES  df-ision  model  incorporated  Method  1  for  setting  of  P,  and  Method  3 


for  setting  of  P 


V 


The  value  selected  for  P,  was  based  on  the  lowest  proportion  of  P  grades 
(numerical  grade  of  4.0  on  the  NATOPS  flight  evaluation)  that  may  be  given 
and  still  result  in  an  overall  rating  of  "Qualified."  The  NATOPS  evaluation 
flight  consists  of  a  number  of  flight  tasks  grouped  in  areas  and  subareas. 

As  tie  tasks  or  subareas  are  performed,  the  pilot's  performance  is  graded 
usiig  a  numerical  score.  Three  numerical  scores  may  be  awarded:  Qualified 
performance  is  assigned  a  "4,"  Conditionally  Qualified  performance  is  assigned 
?■  "2,"  and  Unqualified  performance  is  assigned  a  "0."  The  numerical  scores 
are  averaged  across  all  tasks  and  subareas  to  yield  an  overall  numerical 
score.  To  receive  an  overall  rating  of  "Qualified,"  the  average  of  all  tasks 
or  subareas  must  fall  within  *:he  range  of  3.00  to  4.00.  Thus,  the  criteria 
for  passing  the  NATOPS  flight  evaluation  with  a  "Qualified"  rating  require 
that  at  least  50  percent  of  the  tasks  be  graded  as  "Qualified."  Therefore, 
the  lower  limit  of  proficient  performance  was  set  at  .50  for  all  tasks. 


The  value  selected  for  P?  was  determined  by  examining  performance  scores 
of  a  sample  of  49  Naval  Aviators'  NATOPS  flight  evaluations  given  at  Heli¬ 
copter  Antisubmarine  Squadron  (HS-1),  Naval  Air  Station  (NAS)  Jacksonville, 
Florida.  The  sample  was  restricted  to  only  those  aviators  rated  as 
"Qualified,"  thus  representing  exemplary  performance.  This  examination 
revealed  the  proportion  of  "Qualified"  scores  for  each  subarea  and/or 
flight  task.  This  proportion  is  directly  translated  to  P~  values  for  each 
task  in  the  training  syllabus. 

The  selection  of  alpha  («)  and  beta  (J3)  should  be  based  on  the  criti¬ 
cality  of  accurate  proficiency  decisions.  Small  values  of  alpha  (a)  and  beta 
(/?)  require  additional  task  trials  to  make  decisions  with  greater  confidence. 
Factors  that  are  important  in  selecting  values  for  alpha  (a)  and  beta  (j3)  are 
outlined  below: 
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1.  Alpha  (a)  values 

a.  Safety--potential  harm  to  the  trainee  or  to  others 
due  to  the  trainee's  actual  non-mastery  of  the  task. 

b.  Prerequisite  in  Instruction— potential  problems 

in  future  instruction,  especially  if  the  task  is  pre¬ 
requisite  to  other  tasks. 

c.  Time/Cost—potential  loss  or  destruction  of  equipment 
either  in  training  or  upon  fleet  assignment. 

d.  Trainee's  View  of  the  Training— potential  negative 
view  by  trainee  when  classified  as  proficient  although 
the  trainee  lacks  confidence  in  that  decision.  Also, 
after  fleet  assignment  if  previous  training  has  not 
prepared  him  sufficiently  the  trainee  may  also  have  a 
negative  view  of  the  training  program. 

2.  Beta  (/ 3 )  values 

a.  Instruction— requirement  for  additional  training 
resources  (personnel  and  materials)  for  unnecessary 
training  in  case  of  misclassification  as  not  proficient. 

b.  Trainee  Attitudes— the  attitude  of  trainees  when  tasks 
have  been  mastered  yet  training  continues;  trainee 
frustration;  corresponding  impact  on  performance  in  the 
remainder  of  the  training  program  and  fleet  assignment. 

c.  Cost/ Time— the  additional  cost  and  time  required 
for  additional  training  that  is  not  really  needed. 

Alpha  (a)  and  beta  {/3 )  values  used  in  the  CATES  decison  model  were 
arbitrarily  selected  as  .10.  A  confidence  level  of  90  percent  in  decisions 
made  by  the  model  appears  reasonable  when  the  previously  discussed  factors 
are  considered.  As  rigorous  field  testing  of  the  model  is  conducted,  these 
parameters  may  be  modified  as  indicated  by  empirical  evidence  and  command 
policy.  At  present,  values  of  .10  appear  quite  reasonable. 

After  the  model  parameters  ha/e  been  selected,  calculation  of  the 
decision  boundaries  may  be  accomplished  using  the  Wald  Binomial  Probability 
Ratio  Test.  The  appendix  provides  a  formal  mathematical  discussion  of  this 
test. 


To  illustrate  the  differences  in  task  difficulty,  two  tasks  were  selected 
from  the  HS-1  training  syllabus,  and  the  decision  models  for  these  tasks 
were  calculated.  To  further  show  how  the  decision  models  serve  to  aid  in 
making  proficiency  decisions,  task  protocols  of  a  pilot  trainee  are  imposed 
on  the  model J  Figure  2  shows  the  model  for  the  task  "Running  Takeoff," 
and  figure  3  shows  the  model  for  the  task  "Free  Stream  Recovery." 

^ Actual  trial  data  for  a  pilot  trainee  undergoing  training  at  HS-1,  NAS 
Jacksonville,  FL. 
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Figure  2.  Sequential  Sampling  Decision  Model  for  Running  Takeoff  Task 


Figure  3.  Sequential  Sampling  Decision  Model  for  Freestreara  Recovery  Task 
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Empirical  data  reflect  a  relative  difference  in  task  difficulty.  The 
sample  of  NATOPS  evaluation  scores  indicates  the  proportion  of  "Qualified" 
scores  on  the  Running  Takeoff  task  was  .92,  while  the  proportion  of  "Quali¬ 
fied"  scores  on  the  Free  Stream  Recovery  task  was  .77.  This  relative  differ¬ 
ence  in  task  difficulty  is  represented  in  the  model  as  differences  between 
the  slopes  and  the  widths  between  the  parallel  lines  of  the  two  models.  In 
the  case  of  the  Free  Stream  Recovery  task  (figure  3),  the  slopes  are  less 
steep  (indicating  more  trials  to  reach  proficiency)  and  the  parallel  lines 
are  farther  apart  (indicating  there  will  typically  be  more  uncertainty  about 
individual  trials  before  a  decision  can  be  reached). 

In  these  examples,  the  probability  of  making  decision  errors  (both  type 
I  and  type  II)  as  indicated  earlier  was  set  at  .10  for  both  tasks.  If  this 
level  of  confidence  was  increased  (lower  values  of  alpha  (a)  and  beta  08)), 
the  region  of  uncertainty  would  also  increase.  The  overall  result  is  that 
more  trials  are  required  to  make  a  decision  with  increased  confidence. 

Both  models,  then,  reflect  rather  well  the  true  state  of  affairs  between 
different  tasks  and  their  impact  on  a  rational  decision  process.  The  differ¬ 
ences  in  task  difficulty  relate  directly  to  differences  in  the  model  parameters. 

Figures  2  and  3  also  show  the  decisions  reached  by  the  model  on  student 
performance.  The  student  received  a  total  of  eight  trials  on  the  Running 
Takeoff  task  during  the  training  program.  The  sequence  of  graded  trials  and 
the  graphical  plots  of  the  sequence  are  shown  in  figure  2.  The  first  two 
trials  were  judged  to  be  below  the  standard  of  performance.  On  the  second 
trial  the  decision  model  indicated  the  student  was  "Not  Proficient"  and 
logically  should  be  given  remedial  or  additional  training.  The  sequence  is 
initiated  again  on  trial  three,  and  on  the  fourth  trial  of  that  sequence 
(sixth  trial  given)  the  model  decision  was  "Proficient." 

Figure  3  shows  the  protocol  for  the  Free  Stream  Recovery  task.  Perhaps 
because  of  slower  acquisition  of  a  more  difficult  task,  two  decisions  were 
made  declaring  the  student  "Not  Proficient"  in  the  earlier  sessions  of  task 
exposure.  The  model  does  show  that  more  task  trials  were  required  before  a 
decision  could  be  made  about  proficiency.  This  can  be  attributed  to  increased 
task  difficulty  and  variability  of  performance. 
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SECTION  IV 

PLANNING  FOR  IMPLEMENTATION 

The  role  of  sequential  sampling  decision  models  to  determine  aviation 
task  proficiency  must  be  operationally  explored  in  terms  of  feasibility  and 
subsequent  validity.  A  study  is  currently  underway  to  test  the  concept  at 
the  East  Coast  SH-3  FRS,  HS-1,  NAS,  Jacksonville,  Florida.  The  study  is 
broadly  planned  as  follows: 

1.  identify  a  syllabus  of  specific  training  tasks 

2.  establish  proficiency  decision  model  parameters  from  prior  data 
collected  at  HS-1 

3.  train  instructors  to  render  performance  judgments  on  task  trials; 
i.e. ,  was  performance  a  "1"  or  a  "P"? 

4.  collect  data  on  each  trainee's  task  performance  by  trial 

a.  The  current  decision  model  (unique  to  each  instructor) 
will  determine  when  to  terminate  training  the  task. 

b.  Instructors  and  training  managers  will  have  no  knowledge  of 
CATES  system  decisions  regarding  task  proficiency. 

5.  compare  analytically  the  models  using  final  performance  criterion 
(NATOPS  flight  evaluation  performance). 

6.  make  recommendations  as  to  feasibility. 

Assuming  the  results  of  the  study  are  promising,  it  will  be  desirable  to 
look  toward  incorporating  or  designing  a  CMI  system  for  which  these  models 
are  readily  amenable.  Semple,  Cotton,  and  Sullivan  (1980)  have  summarized 
the  advantages  of  a  CMI  system  for  aircrew  training  devices  applicable  to 
all  aspects  of  aircraft  flight  training.  CMI  systems  compare  a  student's 
training  history  with  a  standard  training  syllabus  made  up  of  lists  of  clearly 
defined  tasks.  The  "ideal"  system  assesses  student  performance  on  each  task 
and  compares  this  performance  with  criteria  of  acceptable  performance.  This 
comparison  identifies  tasks  that  the  student  can  or  still  cannot  perform. 
System  software  then  composes  an  individualized  set  of  instructional  tasks 
that  may  be  trained  in  subsequent  training  sessions  or  flights.  Additional 
factors  that  may  be  considered  in  system  design  include  training  asset  avail¬ 
ability  and  prediction  of  training  completion  dates. 

All  the  virtues  of  a  well  conceived  CMI  system  are  contingent  upon  an 
acceptable,  workable  performance  assessment  schema.  Figure  4  is  a  functional 
flow  diagram  describing  the  CATES  system  to  be  operationally  developed  and 
tested  for  use  by  HS-1.  It  is  premature  to  assert  whether  CATES  will  be  a 
"stand  alone"  system  or  become  an  integral  subsystem  of  the  Aviation  Training 
Support  System  (ATSS)  (Naval  Weapons  Center,  1978).  In  either  event,  imple¬ 
menting  the  proficiency  determination  concept  advanced  in  this  report  can 
only  be  done  efficiently  with  on-line  computer  support.  The  work  of  Ferguson 
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COMPUTER  AIDED  TRAINING  EVALUATION  AND  SCHEDULING  SYSTEM 

(CATES  System) 
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Figure  4.  Functional  Flow  Diagram  of  CATES  System 
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(1969)  and  Kalisch  (1980)  would  have  been  virtually  impossible  without  on¬ 
line  computer  support.  Also  planned  are  future  efforts  to  determine  the 
range  of  applicability  to  other  FRS  settings. 

POST  NOTE 

In  summary,  this  report  has  shown  the  variability  of  flight  task  per¬ 
formance  and  the  difficulty  encountered  In  making  accurate  proficiency  deter¬ 
minations.  The  CATES  system  has  been  introduced  as  a  method  to  formalize  and 
quantify  the  parameters  of  the  decision  process  used  in  making  these  deter¬ 
minations,  thereby  achieving  a  measure  of  control.  Effort  is  underway  to 
operationally  test  the  CATES  system  concerning  feasibility,  validity,  and 
range  of  applicability.  This  report  is  a  prelude  to  that  effort. 
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WALD  BINOMIAL  PROBABILITY  RATIO  TEST 


The  Wald  binomial  probability  ratio  test  was  developed  by  Wald  (1947)  as 
a  means  of  making  statistical  decisions  using  as  limited  a  sample  as  possible. 
The  procedure  involves  the  consideration  of  two  hypotheses: 


'V  p<-pi 

and  H-j :  P  -  Pg  where 

P  is  the  proportion  of  nondefectives  in  the  collection  under  consideration, 
P]  is  the  minimum  proportion  of  nondefectives  at  or  below  which  the  collec¬ 
tion  is  rejected,  and  P2  is  the  desired  proportion  of  nondefectives,  at  or 
above  which  the  collection  is  accepted.  Since  a  simple  hypothesis  is  being 
tested  against  a  simple  alternative,  the  basis  for  deciding  between  H  and 
H.j  may  be  tested  using  the  likelihood  ratio:  0 

(P2)dn  o-p/-dn 
n  Tn  \dn  /i  n  \  n*’dn 


Where:  P,  =  Minimum  proportion  of  nondefectives  at  or  below  which  the 

collection  is  rejected. 

P2  -  Desirable  proportion  of  nondefectives  at  or  above  which  the 
collection  is  accepted. 

n  =  Total  items  in  collection. 

dn  =  Total  nondefectives  in  collection. 


The  sequential  testing  procedure  provides  for  a  postponement  region 
based  on  prescribed  values  of  alpha  (a)  and  beta  [/3 )  that  approximate  the 
two  types  of  errors  found  in  the  statistical  decision  process.  To  test  the 
hypothesis  HQ:  P  =  P^,  calculate  the  likelihood  ratio  and  proceed  as  follows: 

1.  if  P2n  <  8  ,  accept  Hft 

Pln  ' 

D 

2  if  >  1-/3,  accent 

pln  '  “ 


3. 


if 

1-a 


1-/8  ,  take  an  additional  observation. 
a 


These  three  decisions  relate  well  to  the  task  proficiency  problem.  We 
may  use  the  following  rules: 


1.  Accept  the  hypothesis  that  the  grade  of  P  is  accumulated  in  lower 
proportions  than  acceptable  performance  would  indicate. 
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2.  Reject  the  hypothesis  that  the  grade  of  P  Is  accumulated  in  lower 
proportions  than  acceptable  performance  would  indicate.  By  rejecting  this 
hypothesis,  an  alternative  hypothesis  is  accepted  that  the  grade  of  P  is 
accumulated  in  proportions  equal  to  or  greater  than  desired  performance. 

3.  Continue  training  by  taking  an  additional  trial {s ) ;  a  decision 
cannot  be  made  with  specified  confidence. 


The  following  equations  are  used  to  calculate  the  decision  regions  of 
the  sequential  sampling  decision  model. 


dn  < 


log  0 
T^a 


log  __jj_  +  log  1_P1 


+  n 


loo  1”P1 
HPpT 


log  P2  +  log  1_P1 
^1  ^2 


dn  > 


log  1-/3 

a 


log  P2  +  log  1  P1 
-p —  Tip- 

K1  1  v2 


+  n 


log  ^1 


log  P2  +  log  1  P1 

T5 -  TTp~ 

K1  1  Vl 


Where:  dn  =  Accumulation  of  trials  graded  as  "P"  in  the  sequence 


n  =  Total  trials  presented  in  the  sequence 

P1  =  Lowest  acceptable  proportion  of  proficient  trials  (P)  required 
to  pass  the  NATOPS  flight  evaluation  with  a  grade  of  "Qualified." 

D 

2  =  Proportion  of  proficient  trials  (P)  that  represent  desirable 
performance  on  the  NATOPS  flight  evaluation. 

Alpha{  a)  =  The  probability  of  making  a  type  I  error  (deciding  a  student  is 
proficient  when  in  fact  he  is  not  proficient). 

Beta(0)  =  The  probability  of  making  a  type  II  error  (deciding  a  student 
is  not  proficient  when  in  fact  he  is  proficient). 

The  first  term  of  the  two  equations  will  determine  the  intercepts  of  the 
two  linear  equations.  The  width  between  these  intercepts  is  determined 
largely  by  values  selected  for  alpha  (a)  and  beta  (0).  The  width  between  the 
intercepts  translates  into  a  region  of  uncertainty;  thus  as  lower  values  of 
alpha  (a)  and  beta  (0)  are  selected  this  region  of  uncertainty  increases. 
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The  second  term  of  the  equations  determines  the  slopes  of  the  linear 
equation.  Since  the  second  term  is  the  same  for  both  equations,  the  result 
will  be  slopes  with  parallel  lines.  Values  of  Pi  and  P2  as  well  as  differences 
between  P,  and  P2  affect  the  slope  of  the  lines.  This  is  easily  translated 
into  task  difficulty.  As  P2  values  increase,  indicating  easier  tasks,  the 
slope  becomes  more  steep.  This  in  turn  results  in  fewer  trials  required  in 
the  sample  to  reach  a  decision. 

As  differences  In  P]  and  P2  increase,  the  slope  also  becomes  steeper  and 
the  uncertainty  region  decreases.  This  is  consonant  with  rational  decision 
making.  When  the  difference  between  the  lower  level  of  proficiency  and  upper 
level  of  proficiency  is  great.  It  is  easier  to  determine  at  which  proficiency 
level  the  pilot  trainee  is  performing.  The  concept  of  differences  In  Pi 
and  P2  is  analogous  to  the  concept  of  effect  size  in  statistically  testing 
the  difference  between  the  means  of  two  groups.  In  such  statistical  testing, 
when  alpha  (a)  and  beta  OS)  remain  constant,  the  number  of  observations 
required  to  detect  a  significant  difference  may  be  reduced  as  the  anticipated 
effect  size  increases  (Kalisch,  1980). 
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